Understanding How Audit Factors Affecting Case Sales
Background and Introduction
The data on the following sheet is some sample data that has been anonymized. It is a combination of case sales, machines present, & other audit data that we’ve collected on a store level.
1.) Our key metric that we look at is asset utilization. We want to know how well each of our machines are performing from a case sales perspective. Poorly performing machines are not profitable.
2.) It is normal to see a fractional machine number. We give our accounts the benefit of taking into account when a machine is placed. A whole number indicates that the machine has been there for a full 12 months (i.e. a machine value of 0.2 would indicate that the machine has only been at that location for a few months).
3.) Our business model is to sell cases of product to a distributor who in turn sells it to a store. We maintain a relationship with each store because we place a Frazil-owned machine at each location. You will see multiple stores roll up to a single distributor.
Methods and Results
The following table displays the variable names in this data set, along with their descriptions.
| Variable | Description |
|---|---|
| Cases | Number of sale cases in a store (response variable) |
| Location ID | Unique Identifier of a store |
| Location City | as the name show |
| Location States | as the name show |
| Location Zip | Zip Code |
| Machines | number of machine at the store for more than 1 year |
| Price_12oz | Price for 12 oz product |
| Price_20oz | Price for 20 oz product |
| Price_32oz | Price for 32 oz product |
| Broken | If the machine is not working how long has it been broken? |
| Rate | rate of the cleanness and experience in the store |
I start by applying basic summary and exploratory statistics to this data to better understand the data and identify trends, and I choose the variables that are most related to the purpose of the analysis, Cases, Machines, Price_12oz, Price_20oz, Price_32oz, Broken, and Rate. And I will use multiple linear regression for the analysis.
# read excel file
mock <- read.csv("Mock Data Analysis-1.csv", header = TRUE, sep = ",") %>%
select("Cases", "Machines",
"Q3...1..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...12.oz.",
"Q3...2..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...20.oz.",
"Q3...3..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...32.oz.",
"Q4..If.the.machine.is.not.working..how.long.has.it.been.broken.",
"Q5..How.would.you.rate.the.cleanliness.and.experience.in.the.store.") %>%
rename(Price_12oz = "Q3...1..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...12.oz.",
Price_20oz = "Q3...2..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...20.oz.",
Price_32oz = "Q3...3..What.are.the.current.prices.for.a.12oz..20oz..and.32oz.cup.of.Frazil....Price...32.oz.",
Broken = "Q4..If.the.machine.is.not.working..how.long.has.it.been.broken.",
Rate = "Q5..How.would.you.rate.the.cleanliness.and.experience.in.the.store.")
mock$Cases <- as.double(mock$Cases)
mock$Price_12oz <- as.double(mock$Price_12oz)
mock$Price_20oz <- as.double(mock$Price_20oz)
mock$Price_32oz <- as.double(mock$Price_32oz)
mock$Machines <- as.double(mock$Machines)
mock$Machines <- round(mock$Machines)
mock <- filter(mock, !is.na(Cases), Cases > 0,
!(is.na(Price_12oz) & is.na(Price_20oz) & is.na(Price_32oz)),
Machines <= 3)
mock$Broken <- as.factor(mock$Broken)
mock$Rate <- as.factor(mock$Rate)
mock$Machines <- as.factor(mock$Machines)
### show the data set
summary(mock)## Cases Machines Price_12oz Price_20oz Price_32oz
## Min. : 1.00 0: 53 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.: 10.00 1:1103 1st Qu.:1.290 1st Qu.:1.990 1st Qu.:2.690
## Median : 18.00 2: 320 Median :1.490 Median :1.990 Median :2.990
## Mean : 23.64 3: 40 Mean :1.554 Mean :2.138 Mean :2.987
## 3rd Qu.: 31.00 3rd Qu.:1.690 3rd Qu.:2.390 3rd Qu.:3.490
## Max. :153.00 Max. :3.590 Max. :3.990 Max. :9.590
## NA's :130 NA's :65 NA's :244
## Broken
## Less than 24 hours : 15
## More than 1 month : 32
## More than 1 week, less than 1 month : 25
## More than 24 hours, less than 1 week: 17
## Not Broken :1392
## Store clerk unsure : 35
##
## Rate
## 1 - Store is filthy, old appliances, etc. It is noticeably bad. : 49
## 2 - Store is average in these aspects. There is nothing out of the ordinary. :543
## 3 - Store is clean, modern, etc. You are impressed with over all look and feel.:924
##
##
##
##
head(mock)## Cases Machines Price_12oz Price_20oz Price_32oz Broken
## 1 16 1 1.34 1.93 NA Not Broken
## 2 83 2 1.49 1.89 NA Not Broken
## 3 28 1 1.49 1.99 NA Not Broken
## 4 6 1 2.29 3.49 NA Not Broken
## 5 26 1 1.99 2.59 NA Not Broken
## 6 149 2 1.39 1.99 NA Not Broken
## Rate
## 1 2 - Store is average in these aspects. There is nothing out of the ordinary.
## 2 3 - Store is clean, modern, etc. You are impressed with over all look and feel.
## 3 3 - Store is clean, modern, etc. You are impressed with over all look and feel.
## 4 2 - Store is average in these aspects. There is nothing out of the ordinary.
## 5 3 - Store is clean, modern, etc. You are impressed with over all look and feel.
## 6 3 - Store is clean, modern, etc. You are impressed with over all look and feel.
# create data set only with continuous variable
mock_cont <- mock %>% select(Cases, Price_12oz : Price_32oz)
### scatterplot matrix (only with continuous)
pairs(mock_cont, pch=5)### correlation matrix (only used for continuous variables)
cor(mock_cont, use = "na.or.complete")## Cases Price_12oz Price_20oz Price_32oz
## Cases 1.00000000 0.02748816 -0.01529701 -0.05556294
## Price_12oz 0.02748816 1.00000000 0.74622613 0.49811434
## Price_20oz -0.01529701 0.74622613 1.00000000 0.75745704
## Price_32oz -0.05556294 0.49811434 0.75745704 1.00000000
round(cor(mock_cont, use = "na.or.complete"),2)## Cases Price_12oz Price_20oz Price_32oz
## Cases 1.00 0.03 -0.02 -0.06
## Price_12oz 0.03 1.00 0.75 0.50
## Price_20oz -0.02 0.75 1.00 0.76
## Price_32oz -0.06 0.50 0.76 1.00
corrplot(cor(mock_cont, , use = "na.or.complete"), type = "upper")The correlation matrix does not show that price of the drink has a strong correlation with cases. Since the aim of the analysis is to see how well each of our machines are performing from a case sales perspective. I decide to remove price factors because they does they did not have linear relationship with cases.
# read excel file
mock <- read.csv("Mock Data Analysis-1.csv", header = TRUE, sep = ",") %>%
select("Cases", "Machines",
"Q4..If.the.machine.is.not.working..how.long.has.it.been.broken.",
"Q5..How.would.you.rate.the.cleanliness.and.experience.in.the.store.") %>%
rename(Broken = "Q4..If.the.machine.is.not.working..how.long.has.it.been.broken.",
Rate = "Q5..How.would.you.rate.the.cleanliness.and.experience.in.the.store.")
# turn continuous variable to factors
mock$Cases <- as.double(mock$Cases)## Warning: NAs introduced by coercion
# turn machine to whole number that represent number of machine for more than
# a year at the store
mock$Machines <- as.double(mock$Machines)## Warning: NAs introduced by coercion
mock$Machines <- round(mock$Machines)
# there are less than 10 rows have machine > 3, they are very small compare to
# the entire data
mock <- filter(mock, !is.na(Cases), Cases > 0, Machines <= 3)
# turn categorical variables to factors
mock$Broken <- as.factor(mock$Broken)
mock$Rate <- as.factor(mock$Rate)
mock$Machines <- as.factor(mock$Machines)
# change name of levels in Broken and Rate
levels(mock$Broken) <- list(BK = "Less than 24 hours", BK = "More than 1 month",
BK = "More than 1 week, less than 1 month",
BK = "More than 24 hours, less than 1 week",
NB = "Not Broken", NB = "Store clerk unsure", NB = "<NA>")
levels(mock$Rate) <- list(L1 = "1 - Store is filthy, old appliances, etc. It is noticeably bad.",
L2 = "2 - Store is average in these aspects. There is nothing out of the ordinary." ,
L3 = "3 - Store is clean, modern, etc. You are impressed with over all look and feel.")
# show the data set
summary(mock)## Cases Machines Broken Rate
## Min. : 1.00 0: 55 BK : 114 L1 : 76
## 1st Qu.: 9.00 1:1304 NB :1631 L2 : 656
## Median : 17.00 2: 350 NA's: 7 L3 :1014
## Mean : 22.66 3: 43 NA's: 6
## 3rd Qu.: 30.00
## Max. :153.00
head(mock)## Cases Machines Broken Rate
## 1 16 1 NB L2
## 2 4 1 BK L2
## 3 83 2 NB L3
## 4 28 1 NB L3
## 5 6 1 NB L2
## 6 33 2 NB L3
### Box Plot (for categorical)
# remove missing value
mock <- mock %>%
filter(!is.na(Rate), !is.na(Broken))
# create a box plot function
BoxPlotMock <- function(variable, name) {
ggplot(data = mock, mapping = aes(x = variable, y = Cases)) +
geom_boxplot() +
theme_bw() +
xlab(name) +
theme(aspect.ratio = 1)
}
# output box plot
BoxPlotMock(mock$Machines, "Machines")BoxPlotMock(mock$Broken, "Broken")BoxPlotMock(mock$Rate, "Rate")### Interaction plot between Machine, Broken and Rate
ggplot(data = mock,
mapping = aes(x = Machines, y = Cases,
color = Broken, shape = Rate)) +
geom_point(size = 2) +
theme_bw() +
theme(aspect.ratio = 1)From our exploratory data analyses, I notice several interesting features. The box plot shows Machines clearly has strong positive correlation with Cases, and Rate might have some positive correlation, but Broken does not show a strong correlation with Cases. Additionally, I will consider including interaction terms, particularly an interaction between Rate and Machines.
I now want to fit a multiple linear regression model to the data set with Cases as the response and the remaining variables as predictors. Here is the general linear model I want to fit:
\(\text{Cases_i} = \beta_0 + \beta_1\times\text{Machines}_i +\) \(\beta_2\times\text{Broken}_i + \beta_3\times\text{Rate}_i + \epsilon_i\) where \(\epsilon_i \sim N(0, \sigma^2)\)
Variable Selection
I will start by doing variable selection to find the simplest model possible. I also think multicollinearity will be an issue within the variables, so variable selection will help reduce this problem.
# create dummy variable
mock$machine1 <- ifelse(mock$Machines == "1", 1, 0)
mock$machine2 <- ifelse(mock$Machines == "2", 1, 0)
mock$machine3 <- ifelse(mock$Machines == "3", 1, 0)
mock$RateL2 <- ifelse(mock$Rate == "L2", 1, 0)
mock$RateL3 <- ifelse(mock$Rate == "L3", 1, 0)
mock$BrokenNB <- ifelse(mock$Broken == "NB", 1, 0)
mock$BrokenBK <- ifelse(mock$Broken == "BK", 1, 0)# change response variable to the last column
# DON'T include baseline variables
mock_select <- mock %>% select(machine1 : machine3, RateL2, RateL3, BrokenBK, Cases)# best subsets
head(mock_select)## machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases
## 1 1 0 0 1 0 0 16
## 2 1 0 0 1 0 1 4
## 3 0 1 0 0 1 0 83
## 4 1 0 0 0 1 0 28
## 5 1 0 0 1 0 0 6
## 6 0 1 0 0 1 0 33
best_subset_aic <- bestglm(mock_select, IC = "BIC", method = "exhaustive")
best_subset_aic$BestModels ## machine1 machine2 machine3 RateL2 RateL3 BrokenBK Criterion
## 1 TRUE TRUE TRUE FALSE TRUE TRUE 9862.650
## 2 TRUE TRUE TRUE TRUE TRUE TRUE 9863.153
## 3 TRUE TRUE TRUE FALSE FALSE TRUE 9869.848
## 4 TRUE TRUE TRUE TRUE TRUE FALSE 9870.426
## 5 TRUE TRUE TRUE TRUE FALSE TRUE 9871.368
best_model_subset <- best_subset_aic$BestModel
summary(best_model_subset)##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.193 -9.934 -2.927 7.066 109.946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.988 2.321 1.718 0.085950 .
## machine1 11.946 2.308 5.177 2.52e-07 ***
## machine2 35.932 2.431 14.783 < 2e-16 ***
## machine3 51.071 3.411 14.972 < 2e-16 ***
## RateL3 3.133 0.818 3.831 0.000132 ***
## BrokenBK -7.007 1.627 -4.307 1.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared: 0.3282, Adjusted R-squared: 0.3263
## F-statistic: 169.8 on 5 and 1738 DF, p-value: < 2.2e-16
# # forward selection
# best_subset_aic_forward <- bestglm(mock_select, IC = "BIC", method = "forward")
# best_subset_aic_forward$BestModels
# best_model_subset_forward <- best_subset_aic_forward$BestModel
# summary(best_model_subset_forward)# backward selection
best_subset_cv <- bestglm(mock_select, IC = "BIC", method = "backward", t = 100)
best_model_subset_cv <- best_subset_cv$BestModel
summary(best_model_subset_cv)##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.193 -9.934 -2.927 7.066 109.946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.988 2.321 1.718 0.085950 .
## machine1 11.946 2.308 5.177 2.52e-07 ***
## machine2 35.932 2.431 14.783 < 2e-16 ***
## machine3 51.071 3.411 14.972 < 2e-16 ***
## RateL3 3.133 0.818 3.831 0.000132 ***
## BrokenBK -7.007 1.627 -4.307 1.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared: 0.3282, Adjusted R-squared: 0.3263
## F-statistic: 169.8 on 5 and 1738 DF, p-value: < 2.2e-16
# step wise/sequential
best_subset_ss <- bestglm(mock_select, IC = "BIC", method = "seqrep", t = 100)
best_model_subset_ss <- best_subset_ss$BestModel
summary(best_model_subset_ss)##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.193 -9.934 -2.927 7.066 109.946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.988 2.321 1.718 0.085950 .
## machine1 11.946 2.308 5.177 2.52e-07 ***
## machine2 35.932 2.431 14.783 < 2e-16 ***
## machine3 51.071 3.411 14.972 < 2e-16 ***
## RateL3 3.133 0.818 3.831 0.000132 ***
## BrokenBK -7.007 1.627 -4.307 1.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared: 0.3282, Adjusted R-squared: 0.3263
## F-statistic: 169.8 on 5 and 1738 DF, p-value: < 2.2e-16
# LASSO
set.seed(12345) # make sure to set your seed when doing cross validation!
mock_select_x <- as.matrix(mock_select[, 1:6])
mock_select_y <- mock_select[, 7]
# use cross validation to pick the "best" (based on MSE) lambda
mock_select_ridge_LASSO <- cv.glmnet(x = mock_select_x,
y = mock_select_y,
type.measure = "mse",
alpha = 1) # 1 is code for "LASSO"
# plot (log) lambda vs MSE
# autoplot(mock_select_ridge_LASSO, label = FALSE) +
# theme_bw() +
# theme(aspect.ratio = 1)
# lambda.min: value of lambda that gives minimum mean cross-validated error
mock_select_ridge_LASSO$lambda.min## [1] 0.01166017
# lambda.1se: value of lambda within 1 standard error of the minimum
# cross-validated error
mock_select_ridge_LASSO$lambda.1se## [1] 1.772242
coef(mock_select_ridge_LASSO, s = "lambda.min")## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.2518514
## machine1 11.6050102
## machine2 35.5405534
## machine3 50.6646194
## RateL2 5.0627458
## RateL3 7.6703472
## BrokenBK -6.3145739
coef(mock_select_ridge_LASSO, s = "lambda.1se")## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 18.0134509
## machine1 .
## machine2 20.0455649
## machine3 27.1570977
## RateL2 .
## RateL3 .
## BrokenBK -0.1589097
# Elastic Net
set.seed(12345) # make sure to set your seed when doing cross validation!
# use cross validation to pick the "best" (based on MSE) lambda
mock_select_ridge_net <- cv.glmnet(x = mock_select_x,
y = mock_select_y,
type.measure = "mse",
alpha = 0.5)
# plot (log) lambda vs MSE
# autoplot(mock_select_ridge_net, label = FALSE) +
# theme_bw() +
# theme(aspect.ratio = 1)
# lambda.min: value of lambda that gives minimum mean cross-validated error
mock_select_ridge_net$lambda.min## [1] 0.0160738
# lambda.1se: value of lambda within 1 standard error of the minimum
# cross-validated error
mock_select_ridge_net$lambda.1se## [1] 2.681271
coef(mock_select_ridge_net, s = "lambda.min")## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.3000673
## machine1 11.5979015
## machine2 35.5277716
## machine3 50.6602989
## RateL2 5.1228120
## RateL3 7.7286651
## BrokenBK -6.3186443
coef(mock_select_ridge_net, s = "lambda.1se")## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 18.0315155
## machine1 -0.3431857
## machine2 19.4528304
## machine3 27.7585993
## RateL2 .
## RateL3 0.7611503
## BrokenBK -1.7066143
| Variable | Best Subset | Backward | Sequential Replacement | LASSO | Elastic Net |
|---|---|---|---|---|---|
| machine1 | X | X | X | X | |
| machine2 | X | X | X | X | X |
| machine3 | X | X | X | X | X |
| Rate2 | |||||
| Rate3 | X | X | X | X | |
| BrokenBK | X | X | X | X | X |
Given the results from all of the variable selection procedures, shown in the table above, I choose to keep machine1, machine1, machine1, Rate3, and BrokenBK in our final model since four out five methods suggest these varibales are significant to the response variable.
Initial Linear Model
# base model: machine0, RateL1 +BrokenNB
mock_lm <- lm(Cases ~ machine1 + machine2 + machine3 +
RateL3 + BrokenBK, data = mock_select)
summary(mock_lm)##
## Call:
## lm(formula = Cases ~ machine1 + machine2 + machine3 + RateL3 +
## BrokenBK, data = mock_select)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.193 -9.934 -2.927 7.066 109.946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.988 2.321 1.718 0.085950 .
## machine1 11.946 2.308 5.177 2.52e-07 ***
## machine2 35.932 2.431 14.783 < 2e-16 ***
## machine3 51.071 3.411 14.972 < 2e-16 ***
## RateL3 3.133 0.818 3.831 0.000132 ***
## BrokenBK -7.007 1.627 -4.307 1.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared: 0.3282, Adjusted R-squared: 0.3263
## F-statistic: 169.8 on 5 and 1738 DF, p-value: < 2.2e-16
# anova test: break and machine
mock_lm_inter_BkMach <- lm(Cases ~ machine1 + machine2 + machine3 +
RateL3 + BrokenBK + BrokenBK:machine1+ BrokenBK:machine2
+ BrokenBK:machine3
, data = mock_select)
anova(mock_lm, mock_lm_inter_BkMach)## Analysis of Variance Table
##
## Model 1: Cases ~ machine1 + machine2 + machine3 + RateL3 + BrokenBK
## Model 2: Cases ~ machine1 + machine2 + machine3 + RateL3 + BrokenBK +
## BrokenBK:machine1 + BrokenBK:machine2 + BrokenBK:machine3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1738 487832
## 2 1735 486856 3 976.52 1.16 0.3237
interaction between broken and machine are not significant.
# anova test: rate and machine
mock_lm_inter_BkRate <- lm(Cases ~ machine1 + machine2 + machine3 +
RateL3 + BrokenBK + BrokenBK:RateL3
, data = mock_select)
anova(mock_lm, mock_lm_inter_BkRate)## Analysis of Variance Table
##
## Model 1: Cases ~ machine1 + machine2 + machine3 + RateL3 + BrokenBK
## Model 2: Cases ~ machine1 + machine2 + machine3 + RateL3 + BrokenBK +
## BrokenBK:RateL3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1738 487832
## 2 1737 487163 1 668.92 2.385 0.1227
interaction between broken and rate are not significant.
Bases on the result, I create a linear model without interaction.
# base model: machine0, RateL1 +BrokenNB
mock_lm_inter <- lm(Cases ~ machine1 + machine2 + machine3 +
RateL3 + BrokenBK,
data = mock_select)
summary(mock_lm_inter)##
## Call:
## lm(formula = Cases ~ machine1 + machine2 + machine3 + RateL3 +
## BrokenBK, data = mock_select)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.193 -9.934 -2.927 7.066 109.946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.988 2.321 1.718 0.085950 .
## machine1 11.946 2.308 5.177 2.52e-07 ***
## machine2 35.932 2.431 14.783 < 2e-16 ***
## machine3 51.071 3.411 14.972 < 2e-16 ***
## RateL3 3.133 0.818 3.831 0.000132 ***
## BrokenBK -7.007 1.627 -4.307 1.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.75 on 1738 degrees of freedom
## Multiple R-squared: 0.3282, Adjusted R-squared: 0.3263
## F-statistic: 169.8 on 5 and 1738 DF, p-value: < 2.2e-16
Assumption Tests
# Linearity
# partial regression plots
avPlots(mock_lm_inter) + theme(aspect.ratio = 1)## NULL
# residual vs fitted
autoplot(mock_lm_inter, which = 1, ncol = 1, nrow = 1) + theme_bw() +
theme(aspect.ratio = 1)## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
The partial regression parts shows no weird trend on the blue line, and the residual vs fitted plot also has a fairly stright blue line. The linearity assumption probability mets.
On the other hand, the residual vs fitted plot does not equally distributed residuals. The homoscasticity assumption is not met.
## Nomality
# Boxplot
mock_select$residuals <- mock_lm_inter$residuals
ggplot(data = mock_select, mapping = aes(y = residuals)) + geom_boxplot() +
theme_bw() + theme(aspect.ratio = 1)# Histogram
(mock_select_hist <- ggplot(data = mock_select, mapping = aes(x = residuals)) +
# only has x being residuals
# when using this code for future data sets, make sure to change the binwidth:
geom_histogram(mapping = aes(y = ..density..), binwidth = 4) +
# stat_function() overlays the red normal curve on the histogram
stat_function(fun = dnorm,
color = "red",
size = 2,
args = list(mean = mean(mock_select$residuals),
sd = sd(mock_select$residuals))) +
theme(aspect.ratio = 1))# QQ Plot
autoplot(mock_lm_inter, which = 2, ncol = 1, nrow = 1) + theme_bw() +
theme(aspect.ratio = 1)# Shapiro-Wilk Test
shapiro.test(mock_lm_inter$residuals)##
## Shapiro-Wilk normality test
##
## data: mock_lm_inter$residuals
## W = 0.91837, p-value < 2.2e-16
The Shapiro-Wilk test indicates the dat is not nomally distributed. However, box plot has data distributed fairly evenly. Histogram is approxmitely normal, and the QQ plot is also roughly follow the straight line. The nomality assumption is probability met.
# Cook's Distance
mock_select$cooksd <- cooks.distance(mock_lm_inter)
# plot Cook's distance against the observation number
ggplot(data = mock_select) +
geom_point(mapping = aes(x = as.numeric(rownames(mock_select)),
y = cooksd)) +
theme_bw() +
ylab("Cook's Distance") +
xlab("Observation Number") +
geom_hline(mapping = aes(yintercept = 4 / length(cooksd)),
color = "red", linetype = "dashed") +
# scale_x_continuous(limits = c(0, 300)) +
scale_y_continuous(limits = c(0, 0.3)) +
theme(aspect.ratio = 1)# print a list of potential outliers according to Cook's distance
mock_select %>%
mutate(rowNum = row.names(mock_select)) %>% # save original row numbers
filter(cooksd > 4 / length(cooksd)) %>% # select potential outliers
arrange(desc(cooksd)) # order from largest Cook's distance to smallest## machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases residuals
## 1 0 0 1 1 0 0 151 95.94043
## 2 0 0 1 0 1 0 153 94.80708
## 3 0 0 1 1 0 0 117 61.94043
## 4 0 0 1 0 1 0 117 58.80708
## 5 0 0 1 0 1 0 8 -50.19292
## 6 0 0 1 0 1 0 108 49.80708
## 7 0 0 1 0 1 1 12 -39.18577
## 8 0 0 1 1 0 0 10 -45.05957
## 9 0 0 1 1 0 0 11 -44.05957
## 10 0 0 1 0 1 0 15 -43.19292
## 11 0 0 1 0 0 0 13 -42.05957
## 12 0 0 1 0 1 0 18 -40.19292
## 13 0 1 0 0 1 0 153 109.94591
## 14 0 1 0 0 1 0 149 105.94591
## 15 0 0 1 0 1 0 21 -37.19292
## 16 0 0 1 0 1 1 83 31.81423
## 17 0 0 1 0 1 0 95 36.80708
## 18 0 0 1 1 0 0 19 -36.05957
## 19 0 0 1 1 0 0 91 35.94043
## 20 0 0 1 0 0 0 20 -35.05957
## 21 0 1 0 1 0 0 123 83.07926
## 22 0 0 1 0 1 0 25 -33.19292
## 23 0 0 1 0 1 0 25 -33.19292
## 24 0 0 1 1 0 0 87 31.94043
## 25 0 1 0 1 0 0 117 77.07926
## 26 0 1 0 1 0 0 111 71.07926
## 27 0 1 0 0 1 0 117 73.94591
## 28 0 0 0 1 0 0 33 29.01174
## 29 0 1 0 1 0 0 104 64.07926
## 30 0 0 1 1 0 0 30 -25.05957
## 31 0 1 0 1 0 0 98 58.07926
## 32 1 0 0 1 0 1 46 37.07309
## 33 0 1 0 0 1 1 3 -33.04694
## 34 0 1 0 1 0 0 96 56.07926
## 35 0 1 0 0 0 1 64 31.08641
## 36 0 1 0 1 0 0 94 54.07926
## 37 0 1 0 0 1 0 102 58.94591
## 38 0 1 0 1 0 1 3 -29.91359
## 39 0 1 0 1 0 0 92 52.07926
## 40 0 0 1 0 1 0 79 20.80708
## 41 0 1 0 0 1 0 101 57.94591
## 42 0 1 0 0 1 0 98 54.94591
## 43 0 1 0 0 1 0 98 54.94591
## 44 0 1 0 0 0 1 59 26.08641
## 45 1 0 0 0 1 1 41 28.93974
## 46 0 1 0 0 1 0 93 49.94591
## 47 0 1 0 0 1 0 92 48.94591
## 48 0 1 0 0 1 1 11 -25.04694
## 49 0 0 1 1 0 0 38 -17.05957
## 50 1 0 0 1 0 0 84 68.06594
## 51 1 0 0 1 0 0 84 68.06594
## 52 1 0 0 0 1 0 95 75.93259
## 53 0 1 0 0 1 0 90 46.94591
## 54 0 1 0 0 1 1 12 -24.04694
## 55 0 1 0 1 0 1 56 23.08641
## 56 0 0 0 1 0 0 22 18.01174
## 57 0 1 0 1 0 0 80 40.07926
## 58 0 1 0 1 0 0 80 40.07926
## 59 0 1 0 0 1 0 88 44.94591
## 60 0 1 0 0 1 0 88 44.94591
## 61 0 1 0 0 1 1 13 -23.04694
## 62 0 0 1 0 1 0 74 15.80708
## 63 0 1 0 0 1 0 87 43.94591
## 64 0 1 0 0 1 0 87 43.94591
## 65 0 1 0 1 0 0 1 -38.92074
## 66 0 1 0 1 0 0 78 38.07926
## 67 0 1 0 1 0 0 78 38.07926
## 68 0 1 0 1 0 0 2 -37.92074
## 69 0 0 1 1 0 0 70 14.94043
## 70 0 1 0 0 1 0 1 -42.05409
## 71 1 0 0 1 0 0 75 59.06594
## 72 0 1 0 1 0 0 77 37.07926
## 73 0 1 0 1 0 0 3 -36.92074
## 74 0 1 0 1 0 0 3 -36.92074
## 75 0 1 0 0 1 0 2 -41.05409
## 76 0 1 0 0 1 0 2 -41.05409
## 77 0 1 0 0 1 0 84 40.94591
## 78 0 1 0 1 0 0 4 -35.92074
## 79 0 1 0 1 0 0 4 -35.92074
## 80 0 1 0 1 0 0 4 -35.92074
## 81 0 1 0 0 1 0 83 39.94591
## 82 0 1 0 1 0 0 75 35.07926
## 83 0 1 0 1 0 0 75 35.07926
## 84 0 1 0 0 1 0 4 -39.05409
## 85 0 1 0 0 1 0 4 -39.05409
## 86 0 1 0 0 1 0 82 38.94591
## 87 0 1 0 0 1 0 5 -38.05409
## 88 0 1 0 0 1 0 5 -38.05409
## 89 0 1 0 0 1 0 6 -37.05409
## 90 0 1 0 0 1 0 80 36.94591
## 91 0 0 1 0 1 0 45 -13.19292
## 92 1 0 0 1 0 1 30 21.07309
## 93 0 1 0 1 0 0 72 32.07926
## 94 0 1 0 1 0 0 8 -31.92074
## 95 0 1 0 1 0 0 8 -31.92074
## 96 0 0 0 1 0 0 18 14.01174
## 97 0 1 0 0 1 0 8 -35.05409
## 98 0 1 0 0 1 0 8 -35.05409
## 99 0 1 0 0 1 0 78 34.94591
## cooksd rowNum
## 1 0.139172797 1351
## 2 0.132630482 592
## 3 0.058009547 859
## 4 0.051029467 266
## 5 0.037174639 194
## 6 0.036605299 1050
## 7 0.030734176 455
## 8 0.030699060 948
## 9 0.029351581 97
## 10 0.027528780 361
## 11 0.026747343 1678
## 12 0.023837514 152
## 13 0.022801908 761
## 14 0.021172952 9
## 15 0.020411851 1120
## 16 0.020258519 1081
## 17 0.019990540 1165
## 18 0.019660387 847
## 19 0.019530690 1039
## 20 0.018585068 1677
## 21 0.016376985 1634
## 22 0.016257462 1109
## 23 0.016257462 1416
## 24 0.015425263 1658
## 25 0.014096905 1296
## 26 0.011987662 1411
## 27 0.010314337 1699
## 28 0.009973397 714
## 29 0.009742797 1598
## 30 0.009495071 885
## 31 0.008003701 1138
## 32 0.007812021 1736
## 33 0.007780446 108
## 34 0.007461966 622
## 35 0.006995339 1388
## 36 0.006939212 438
## 37 0.006554210 823
## 38 0.006477458 1694
## 39 0.006435441 122
## 40 0.006388282 1452
## 41 0.006333716 687
## 42 0.005694869 112
## 43 0.005694869 1422
## 44 0.004926021 1134
## 45 0.004923234 1163
## 46 0.004705577 1544
## 47 0.004519036 1498
## 48 0.004469422 1212
## 49 0.004400349 735
## 50 0.004341982 1025
## 51 0.004341982 1730
## 52 0.004341962 1579
## 53 0.004157273 314
## 54 0.004119663 1726
## 55 0.003858162 331
## 56 0.003844209 1721
## 57 0.003811431 724
## 58 0.003811431 920
## 59 0.003810600 770
## 60 0.003810600 1597
## 61 0.003784152 691
## 62 0.003686931 261
## 63 0.003642922 462
## 64 0.003642922 1143
## 65 0.003594272 137
## 66 0.003440533 723
## 67 0.003440533 1477
## 68 0.003411948 1568
## 69 0.003375027 727
## 70 0.003336027 1447
## 71 0.003269659 159
## 72 0.003262202 105
## 73 0.003234369 181
## 74 0.003234369 1470
## 75 0.003179260 603
## 76 0.003179260 651
## 77 0.003162525 593
## 78 0.003061535 1048
## 79 0.003061535 1378
## 80 0.003061535 1687
## 81 0.003009938 3
## 82 0.002919776 560
## 83 0.002919776 1118
## 84 0.002877042 1683
## 85 0.002877042 1708
## 86 0.002861124 170
## 87 0.002731592 655
## 88 0.002731592 1128
## 89 0.002589915 982
## 90 0.002574813 1319
## 91 0.002568288 969
## 92 0.002524075 1371
## 93 0.002441728 1311
## 94 0.002417657 359
## 95 0.002417657 454
## 96 0.002326375 791
## 97 0.002317878 475
## 98 0.002317878 536
## 99 0.002303593 1580
# calculate the DFFITS
mock_select$dffits <- dffits(mock_lm_inter)
# plot the DFFITS against the observation number
ggplot(data = mock_select) +
geom_point(mapping = aes(x = as.numeric(rownames(mock_select)),
y = abs(dffits))) +
theme_bw() +
ylab("Absolute Value of DFFITS for Y") +
xlab("Observation Number") +
# for n > 30
geom_hline(mapping = aes(yintercept = 2 * sqrt(length(mock_lm_inter$coefficients) /
length(dffits))),
color = "red", linetype = "dashed") +
# for n <= 30 (code for future, small data sets)
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") +
scale_y_continuous(limits = c(0, 0.8)) +
# scale_y_continuous(limits = c(0, 1.1)) +
theme(aspect.ratio = 1)## Warning: Removed 2 rows containing missing values (geom_point).
# print a list of potential influential points according to DFFITS
# for n > 30
mock_select %>%
mutate(rowNum = row.names(mock_select)) %>% # save original row numbers
# select potential influential pts
filter(abs(dffits) > 2 * sqrt(length(mock_lm_inter$coefficients) /
length(dffits))) %>%
arrange(desc(abs(dffits))) # order from largest DFFITS to smallest## machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases residuals
## 1 0 0 1 1 0 0 151 95.94043
## 2 0 0 1 0 1 0 153 94.80708
## 3 0 0 1 1 0 0 117 61.94043
## 4 0 0 1 0 1 0 117 58.80708
## 5 0 0 1 0 1 0 8 -50.19292
## 6 0 0 1 0 1 0 108 49.80708
## 7 0 0 1 0 1 1 12 -39.18577
## 8 0 0 1 1 0 0 10 -45.05957
## 9 0 0 1 1 0 0 11 -44.05957
## 10 0 0 1 0 1 0 15 -43.19292
## 11 0 0 1 0 0 0 13 -42.05957
## 12 0 0 1 0 1 0 18 -40.19292
## 13 0 1 0 0 1 0 153 109.94591
## 14 0 1 0 0 1 0 149 105.94591
## 15 0 0 1 0 1 0 21 -37.19292
## 16 0 0 1 0 1 1 83 31.81423
## 17 0 0 1 0 1 0 95 36.80708
## 18 0 0 1 1 0 0 19 -36.05957
## 19 0 0 1 1 0 0 91 35.94043
## 20 0 0 1 0 0 0 20 -35.05957
## 21 0 1 0 1 0 0 123 83.07926
## 22 0 0 1 0 1 0 25 -33.19292
## 23 0 0 1 0 1 0 25 -33.19292
## 24 0 0 1 1 0 0 87 31.94043
## 25 0 1 0 1 0 0 117 77.07926
## 26 0 1 0 1 0 0 111 71.07926
## 27 0 1 0 0 1 0 117 73.94591
## 28 0 0 0 1 0 0 33 29.01174
## 29 0 1 0 1 0 0 104 64.07926
## 30 0 0 1 1 0 0 30 -25.05957
## 31 0 1 0 1 0 0 98 58.07926
## 32 1 0 0 1 0 1 46 37.07309
## 33 0 1 0 0 1 1 3 -33.04694
## 34 0 1 0 1 0 0 96 56.07926
## 35 0 1 0 0 0 1 64 31.08641
## 36 0 1 0 1 0 0 94 54.07926
## 37 0 1 0 0 1 0 102 58.94591
## 38 0 1 0 1 0 1 3 -29.91359
## 39 0 1 0 1 0 0 92 52.07926
## 40 0 0 1 0 1 0 79 20.80708
## 41 0 1 0 0 1 0 101 57.94591
## 42 0 1 0 0 1 0 98 54.94591
## 43 0 1 0 0 1 0 98 54.94591
## 44 0 1 0 0 0 1 59 26.08641
## 45 1 0 0 0 1 1 41 28.93974
## 46 0 1 0 0 1 0 93 49.94591
## 47 0 1 0 0 1 0 92 48.94591
## 48 0 1 0 0 1 1 11 -25.04694
## 49 0 0 1 1 0 0 38 -17.05957
## 50 1 0 0 0 1 0 95 75.93259
## 51 1 0 0 1 0 0 84 68.06594
## 52 1 0 0 1 0 0 84 68.06594
## 53 0 1 0 0 1 0 90 46.94591
## 54 0 1 0 0 1 1 12 -24.04694
## 55 0 1 0 1 0 1 56 23.08641
## 56 0 0 0 1 0 0 22 18.01174
## 57 0 1 0 0 1 0 88 44.94591
## 58 0 1 0 0 1 0 88 44.94591
## 59 0 1 0 1 0 0 80 40.07926
## 60 0 1 0 1 0 0 80 40.07926
## 61 0 1 0 0 1 1 13 -23.04694
## 62 0 0 1 0 1 0 74 15.80708
## 63 0 1 0 0 1 0 87 43.94591
## 64 0 1 0 0 1 0 87 43.94591
## 65 0 1 0 1 0 0 1 -38.92074
## 66 0 1 0 1 0 0 78 38.07926
## 67 0 1 0 1 0 0 78 38.07926
## 68 0 1 0 1 0 0 2 -37.92074
## 69 0 0 1 1 0 0 70 14.94043
## 70 0 1 0 0 1 0 1 -42.05409
## 71 1 0 0 1 0 0 75 59.06594
## 72 0 1 0 1 0 0 77 37.07926
## 73 0 1 0 1 0 0 3 -36.92074
## 74 0 1 0 1 0 0 3 -36.92074
## 75 0 1 0 0 1 0 2 -41.05409
## 76 0 1 0 0 1 0 2 -41.05409
## 77 0 1 0 0 1 0 84 40.94591
## 78 0 1 0 1 0 0 4 -35.92074
## 79 0 1 0 1 0 0 4 -35.92074
## 80 0 1 0 1 0 0 4 -35.92074
## 81 0 1 0 0 1 0 83 39.94591
## 82 0 1 0 1 0 0 75 35.07926
## 83 0 1 0 1 0 0 75 35.07926
## 84 0 1 0 0 1 0 4 -39.05409
## 85 0 1 0 0 1 0 4 -39.05409
## 86 0 1 0 0 1 0 82 38.94591
## 87 0 1 0 0 1 0 5 -38.05409
## 88 0 1 0 0 1 0 5 -38.05409
## 89 0 1 0 0 1 0 6 -37.05409
## 90 0 1 0 0 1 0 80 36.94591
## 91 0 0 1 0 1 0 45 -13.19292
## 92 1 0 0 1 0 1 30 21.07309
## 93 0 1 0 1 0 0 72 32.07926
## 94 0 1 0 1 0 0 8 -31.92074
## 95 0 1 0 1 0 0 8 -31.92074
## 96 0 0 0 1 0 0 18 14.01174
## 97 0 1 0 0 1 0 8 -35.05409
## 98 0 1 0 0 1 0 8 -35.05409
## 99 0 1 0 0 1 0 78 34.94591
## 100 0 1 0 1 0 0 71 31.07926
## cooksd dffits rowNum
## 1 0.139172797 0.9225034 1351
## 2 0.132630482 0.9003462 592
## 3 0.058009547 0.5921854 859
## 4 0.051029467 0.5551926 266
## 5 0.037174639 -0.4733972 194
## 6 0.036605299 0.4697390 1050
## 7 0.030734176 -0.4299999 455
## 8 0.030699060 -0.4299732 948
## 9 0.029351581 -0.4203914 97
## 10 0.027528780 -0.4070956 361
## 11 0.026747343 -0.4012356 1678
## 12 0.023837514 -0.3787206 152
## 13 0.022801908 0.3744572 761
## 14 0.021172952 0.3605058 9
## 15 0.020411851 -0.3503672 1120
## 16 0.020258519 0.3489155 1081
## 17 0.019990540 0.3467220 1165
## 18 0.019660387 -0.3438275 847
## 19 0.019530690 0.3426885 1039
## 20 0.018585068 -0.3342675 1677
## 21 0.016376985 0.3156271 1634
## 22 0.016257462 -0.3125935 1109
## 23 0.016257462 -0.3125935 1416
## 24 0.015425263 0.3044618 1658
## 25 0.014096905 0.2925392 1296
## 26 0.011987662 0.2695179 1411
## 27 0.010314337 0.2501076 1699
## 28 0.009973397 0.2447679 714
## 29 0.009742797 0.2427366 1598
## 30 0.009495071 -0.2387737 885
## 31 0.008003701 0.2198411 1138
## 32 0.007812021 0.2167459 1736
## 33 0.007780446 -0.2162446 108
## 34 0.007461966 0.2122205 622
## 35 0.006995339 0.2050174 1388
## 36 0.006939212 0.2046052 438
## 37 0.006554210 0.1989610 823
## 38 0.006477458 -0.1972679 1694
## 39 0.006435441 0.1969950 122
## 40 0.006388282 0.1958124 1452
## 41 0.006333716 0.1955620 687
## 42 0.005694869 0.1853723 112
## 43 0.005694869 0.1853723 1422
## 44 0.004926021 0.1719909 1134
## 45 0.004923234 0.1719700 1163
## 46 0.004705577 0.1684123 1544
## 47 0.004519036 0.1650236 1498
## 48 0.004469422 -0.1638170 1212
## 49 0.004400349 -0.1624901 735
## 50 0.004341962 0.1623225 1579
## 51 0.004341982 0.1621325 1025
## 52 0.004341982 0.1621325 1730
## 53 0.004157273 0.1582491 314
## 54 0.004119663 -0.1572686 1726
## 55 0.003858162 0.1521882 331
## 56 0.003844209 0.1518803 1721
## 57 0.003810600 0.1514786 770
## 58 0.003810600 0.1514786 1597
## 59 0.003811431 0.1514306 724
## 60 0.003811431 0.1514306 920
## 61 0.003784152 -0.1507212 691
## 62 0.003686931 0.1487295 261
## 63 0.003642922 0.1480948 462
## 64 0.003642922 0.1480948 1143
## 65 0.003594272 -0.1470395 137
## 66 0.003440533 0.1438509 723
## 67 0.003440533 0.1438509 1477
## 68 0.003411948 -0.1432503 1568
## 69 0.003375027 0.1422955 727
## 70 0.003336027 -0.1416957 1447
## 71 0.003269659 0.1405280 159
## 72 0.003262202 0.1400623 105
## 73 0.003234369 -0.1394619 181
## 74 0.003234369 -0.1394619 1470
## 75 0.003179260 -0.1383145 603
## 76 0.003179260 -0.1383145 651
## 77 0.003162525 0.1379487 593
## 78 0.003061535 -0.1356743 1048
## 79 0.003061535 -0.1356743 1378
## 80 0.003061535 -0.1356743 1687
## 81 0.003009938 0.1345684 3
## 82 0.002919776 0.1324878 560
## 83 0.002919776 0.1324878 1118
## 84 0.002877042 -0.1315546 1683
## 85 0.002877042 -0.1315546 1708
## 86 0.002861124 0.1311890 170
## 87 0.002731592 -0.1281758 655
## 88 0.002731592 -0.1281758 1128
## 89 0.002589915 -0.1247979 982
## 90 0.002574813 0.1244325 1319
## 91 0.002568288 -0.1241229 969
## 92 0.002524075 0.1230840 1371
## 93 0.002441728 0.1211322 1311
## 94 0.002417657 -0.1205324 359
## 95 0.002417657 -0.1205324 454
## 96 0.002326375 0.1181353 791
## 97 0.002317878 -0.1180444 475
## 98 0.002317878 -0.1180444 536
## 99 0.002303593 0.1176791 1580
## 100 0.002291870 0.1173486 1702
Although DFFITS plot does not show dots that are too far away from the rest of the points. There are two dots that might be outliers in the Cook’s distance plot. The model describes all observations assumption might not be met.
After fitting the linear regression model and checking the assumptions, I notice several assumptions may not be met. Specifically, the residual vs fitted plot does not equally distributed residuals. The homoscasticity assumption is not met. And describes all observations assumption might not be met.
Trying Several Linear Models
Since homoscedasticity was likely not met, I apply a Box-Cox transform to help us determine which transformation to use when transforming Y.
# close to 0. transform log
bc <- boxCox(mock_lm_inter, family="yjPower", plotit = TRUE)bc$x[which.max(bc$y)]## [1] 0.1818182
The Box-Cox has a value of 0.1818182, which suggests to transform suggested log transform on y and this helped with the homoscedasticity assumption.
# create trans linear model
mock_select$Log_Cases <- log(mock_select$Cases)
mock_lm_inter_trans <- lm(Log_Cases ~ machine1 + machine2 + machine3 +
RateL3 + BrokenBK
, data = mock_select)
summary(mock_lm_inter_trans)##
## Call:
## lm(formula = Log_Cases ~ machine1 + machine2 + machine3 + RateL3 +
## BrokenBK, data = mock_select)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5681 -0.4844 0.1144 0.5764 2.2109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.28561 0.11270 11.408 < 2e-16 ***
## machine1 1.16493 0.11203 10.399 < 2e-16 ***
## machine2 2.05135 0.11801 17.383 < 2e-16 ***
## machine3 2.39578 0.16560 14.467 < 2e-16 ***
## RateL3 0.23112 0.03971 5.820 7.00e-09 ***
## BrokenBK -0.35038 0.07899 -4.435 9.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8134 on 1738 degrees of freedom
## Multiple R-squared: 0.2637, Adjusted R-squared: 0.2616
## F-statistic: 124.5 on 5 and 1738 DF, p-value: < 2.2e-16
# create predicting value for log y
machine1_values <- mock_select$machine1
machine2_values <- mock_select$machine2
machine3_values <- mock_select$machine3
RateL3_values <- mock_select$RateL3
BrokenBK_values <- mock_select$BrokenBK
log_Cases_preds <- predict(mock_lm_inter_trans,
newdata = data.frame(machine1 = machine1_values,
machine2 = machine2_values,
machine3 = machine3_values,
RateL3 = RateL3_values,
BrokenBK = BrokenBK_values))
Cases_preds <- exp(log_Cases_preds) ## use exp to "UNDO" the log transform
# Store results in a data frame for plotting
preds <- data.frame("machine1_values" = machine1_values,
"machine2_values" = machine2_values,
"machine3_values" = machine3_values,
"RateL3" = RateL3_values,
"BrokenBK" = BrokenBK_values,
"log_Cases_preds" = log_Cases_preds)## Linearity, Homoscasticity
# partial regression plots
avPlots(mock_lm_inter_trans) + theme(aspect.ratio = 1)## NULL
# residual vs fitted
autoplot(mock_lm_inter_trans, which = 1, ncol = 1, nrow = 1) + theme_bw() +
theme(aspect.ratio = 1)Both partial regrssion plot and residuals vs fitted plot show no weird curve on the blue line. The linearity assumption is probability met.
The residual vs fitted plot has equally spread residuals aound a horizontal line with not distinct pattern, which indicates the homoscasticity assumption is met.
# Boxplot
mock_select$residuals <- mock_lm_inter_trans$residuals
ggplot(data = mock_select, mapping = aes(y = residuals)) + geom_boxplot() +
theme_bw() + theme(aspect.ratio = 1)# Histogram
(mock_select_hist <- ggplot(data = mock_select, mapping = aes(x = residuals)) +
# only has x being residuals
# when using this code for future data sets, make sure to change the binwidth:
geom_histogram(mapping = aes(y = ..density..)) +
# stat_function() overlays the red normal curve on the histogram
stat_function(fun = dnorm,
color = "red",
size = 2,
args = list(mean = mean(mock_select$residuals),
sd = sd(mock_select$residuals))) +
theme(aspect.ratio = 1))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# QQ Plot
autoplot(mock_lm_inter_trans, which = 2, ncol = 1, nrow = 1) + theme_bw() +
theme(aspect.ratio = 1)# Shapiro-Wilk Test
shapiro.test(mock_lm_inter_trans$residuals)##
## Shapiro-Wilk normality test
##
## data: mock_lm_inter_trans$residuals
## W = 0.96838, p-value < 2.2e-16
The Shapiro-Wilk test indicates the dat is not nomally distributed. However, box plot has data distributed fairly evenly. Histogram is approxmitely normal, and the QQ plot is also roughly follow the straight line. The nomality assumption is probability met.
# Cook's Distance
mock_select$cooksd <- cooks.distance(mock_lm_inter_trans)
# plot Cook's distance against the observation number
ggplot(data = mock_select) +
geom_point(mapping = aes(x = as.numeric(rownames(mock_select)),
y = cooksd)) +
theme_bw() +
ylab("Cook's Distance") +
xlab("Observation Number") +
geom_hline(mapping = aes(yintercept = 4 / length(cooksd)),
color = "red", linetype = "dashed") +
# scale_x_continuous(limits = c(0, 300)) +
scale_y_continuous(limits = c(0, 0.08)) +
theme(aspect.ratio = 1)# print a list of potential outliers according to Cook's distance
mock_select %>%
mutate(rowNum = row.names(mock_select)) %>% # save original row numbers
filter(cooksd > 4 / length(cooksd)) %>% # select potential outliers
arrange(desc(cooksd)) # order from largest Cook's distance to smallest## machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases residuals
## 1 0 0 0 1 0 0 33 2.2109014
## 2 0 0 1 0 1 0 8 -1.8330642
## 3 0 0 0 1 0 0 22 1.8054363
## 4 0 1 0 0 1 1 3 -2.1190840
## 5 1 0 0 0 1 1 1 -2.3312761
## 6 0 0 0 1 0 0 18 1.6047656
## 7 0 0 1 1 0 0 10 -1.3787998
## 8 0 0 1 1 0 0 151 1.3358950
## 9 0 1 0 1 0 0 1 -3.3369544
## 10 0 0 0 0 1 0 1 -1.5167270
## 11 0 0 0 0 1 0 1 -1.5167270
## 12 0 1 0 1 0 1 3 -1.8879631
## 13 0 0 1 1 0 0 11 -1.2834896
## 14 0 1 0 0 1 0 1 -3.5680753
## 15 0 0 1 0 1 1 12 -1.0772201
## 16 0 0 0 0 0 1 8 1.1442144
## 17 0 0 1 0 1 0 15 -1.2044556
## 18 0 0 0 1 0 0 1 -1.2856061
## 19 0 0 1 0 0 0 13 -1.1164355
## 20 0 0 1 0 1 0 153 1.1179321
## 21 0 0 1 1 0 0 117 1.0807891
## 22 1 0 0 1 0 1 46 1.7284862
## 23 0 1 0 1 0 0 2 -2.6438072
## 24 0 0 0 0 1 0 15 1.1913232
## 25 1 0 0 0 1 1 2 -1.6381289
## 26 1 0 0 0 1 1 2 -1.6381289
## 27 0 1 0 0 1 0 2 -2.8749281
## 28 0 1 0 0 1 0 2 -2.8749281
## 29 0 0 1 0 1 0 18 -1.0221340
## 30 0 0 1 0 1 1 83 0.8567138
## 31 0 0 0 1 0 1 6 0.8565324
## 32 0 0 0 1 0 0 10 1.0169790
## 33 0 0 0 1 0 0 10 1.0169790
## 34 0 1 0 1 0 0 3 -2.2383421
## 35 0 1 0 1 0 0 3 -2.2383421
## 36 1 0 0 1 0 1 2 -1.4070080
## 37 1 0 0 1 0 1 2 -1.4070080
## 38 1 0 0 0 1 1 41 1.3822960
## 39 0 0 1 0 1 0 21 -0.8679834
## 40 0 0 0 0 1 0 12 0.9681796
## 41 0 0 1 0 1 0 117 0.8496681
## 42 0 0 1 1 0 0 91 0.8294746
## 43 0 1 0 0 0 1 64 1.1723077
## 44 1 0 0 1 0 1 30 1.3010422
## 45 0 0 1 1 0 0 87 0.7845232
## 46 0 1 0 1 0 0 4 -1.9506600
## 47 0 1 0 1 0 0 4 -1.9506600
## 48 0 1 0 1 0 0 4 -1.9506600
## 49 0 1 0 0 1 0 4 -2.1817809
## 50 0 1 0 0 1 0 4 -2.1817809
## 51 1 0 0 0 1 1 3 -1.2326638
## 52 1 0 0 0 1 1 3 -1.2326638
## 53 0 0 1 0 1 0 108 0.7696254
## 54 0 1 0 0 0 1 59 1.0909621
## 55 0 0 1 1 0 0 19 -0.7369459
## 56 0 1 0 1 0 1 56 1.0387763
## 57 0 0 0 0 1 0 2 -0.8235799
## 58 0 0 0 0 1 0 2 -0.8235799
## 59 0 0 0 0 1 0 2 -0.8235799
## 60 0 0 0 0 1 0 2 -0.8235799
## 61 0 0 0 0 1 0 2 -0.8235799
## 62 0 0 0 0 1 0 2 -0.8235799
## 63 0 0 0 0 1 0 2 -0.8235799
## 64 0 0 0 0 1 0 2 -0.8235799
## 65 0 0 0 0 1 0 2 -0.8235799
## 66 0 0 0 0 1 0 2 -0.8235799
## 67 0 0 0 0 1 0 2 -0.8235799
## 68 1 0 0 1 0 1 26 1.1579413
## 69 0 0 0 1 0 0 8 0.7938354
## 70 0 1 0 0 1 0 5 -1.9586374
## 71 0 1 0 0 1 0 5 -1.9586374
## 72 0 0 1 0 0 0 20 -0.6856526
## 73 0 0 1 0 1 0 25 -0.6936300
## 74 0 0 1 0 1 0 25 -0.6936300
## 75 1 0 0 0 0 1 23 1.0353390
## 76 0 0 1 0 1 0 95 0.6413711
## 77 0 1 0 0 1 0 6 -1.7763158
## 78 1 0 0 1 0 1 3 -1.0015429
## 79 1 0 0 1 0 1 3 -1.0015429
## 80 1 0 0 1 0 0 1 -2.4505342
## 81 1 0 0 1 0 0 1 -2.4505342
## 82 1 0 0 1 0 0 1 -2.4505342
## 83 1 0 0 1 0 0 1 -2.4505342
## 84 1 0 0 0 1 1 27 0.9645608
## 85 1 0 0 0 1 0 1 -2.6816551
## 86 1 0 0 0 1 0 1 -2.6816551
## 87 1 0 0 0 1 0 1 -2.6816551
## 88 1 0 0 0 1 0 1 -2.6816551
## 89 1 0 0 0 1 0 1 -2.6816551
## cooksd dffits Log_Cases rowNum
## 1 0.024573483 0.24476793 3.4965076 714
## 2 0.021035348 -0.47339722 2.0794415 194
## 3 0.016386730 0.15188029 3.0910425 1721
## 4 0.013572843 -0.21624464 1.0986123 108
## 5 0.013554427 -0.06567523 0.0000000 179
## 6 0.012946464 0.11813529 2.8903718 791
## 7 0.012195116 -0.42997321 2.3025851 948
## 8 0.011447961 0.92250336 5.0172798 1351
## 9 0.011209384 -0.14703954 0.0000000 137
## 10 0.011137215 -0.05064049 0.0000000 467
## 11 0.011137215 -0.05064049 0.0000000 1541
## 12 0.010946797 -0.19726791 1.0986123 1694
## 13 0.010567402 -0.42039136 2.3978953 97
## 14 0.010188578 -0.14169568 0.0000000 1447
## 15 0.009853857 -0.42999994 2.4849066 455
## 16 0.009480895 0.11149240 2.0794415 1395
## 17 0.009081883 -0.40709558 2.7080502 361
## 18 0.008308906 -0.02518954 0.0000000 211
## 19 0.007995596 -0.40123561 2.5649494 1678
## 20 0.007823934 0.90034618 5.0304379 592
## 21 0.007493167 0.59218539 4.7621739 859
## 22 0.007204603 0.21674586 3.8286414 1736
## 23 0.007036242 -0.14325026 0.6931472 1568
## 24 0.006871015 0.06517498 2.7080502 1271
## 25 0.006692523 -0.05973597 0.6931472 126
## 26 0.006692523 -0.05973597 0.6931472 522
## 27 0.006614538 -0.13831445 0.6931472 603
## 28 0.006614538 -0.13831445 0.6931472 651
## 29 0.006540486 -0.37872059 2.8903718 152
## 30 0.006232595 0.34891548 4.4188406 1081
## 31 0.005312783 0.09125197 1.7917595 1274
## 32 0.005199384 0.05067750 2.3025851 488
## 33 0.005199384 0.05067750 2.3025851 789
## 34 0.005043525 -0.13946186 1.0986123 181
## 35 0.005043525 -0.13946186 1.0986123 1470
## 36 0.004773879 -0.04044217 0.6931472 1058
## 37 0.004773879 -0.04044217 0.6931472 1127
## 38 0.004765361 0.17196999 3.7135721 1163
## 39 0.004716470 -0.35036721 3.0445224 1120
## 40 0.004538095 0.04035549 2.4849066 805
## 41 0.004519527 0.55519255 4.7621739 266
## 42 0.004413567 0.34268848 4.5108595 1039
## 43 0.004220689 0.20501744 4.1588831 1388
## 44 0.004081887 0.12308395 3.4011974 1371
## 45 0.003948164 0.30446182 4.4659081 1658
## 46 0.003830402 -0.13567433 1.3862944 1048
## 47 0.003830402 -0.13567433 1.3862944 1378
## 48 0.003830402 -0.13567433 1.3862944 1687
## 49 0.003809499 -0.13155455 1.3862944 1683
## 50 0.003809499 -0.13155455 1.3862944 1708
## 51 0.003789510 -0.05379709 1.0986123 75
## 52 0.003789510 -0.05379709 1.0986123 520
## 53 0.003708115 0.46973900 4.6821312 1050
## 54 0.003655270 0.17199093 4.0775374 1134
## 55 0.003483812 -0.34382755 2.9444390 847
## 56 0.003313937 0.15218824 4.0253517 331
## 57 0.003283773 -0.04236757 0.6931472 469
## 58 0.003283773 -0.04236757 0.6931472 550
## 59 0.003283773 -0.04236757 0.6931472 1278
## 60 0.003283773 -0.04236757 0.6931472 1279
## 61 0.003283773 -0.04236757 0.6931472 1282
## 62 0.003283773 -0.04236757 0.6931472 1286
## 63 0.003283773 -0.04236757 0.6931472 1314
## 64 0.003283773 -0.04236757 0.6931472 1472
## 65 0.003283773 -0.04236757 0.6931472 1478
## 66 0.003283773 -0.04236757 0.6931472 1615
## 67 0.003283773 -0.04236757 0.6931472 1616
## 68 0.003233340 0.09970495 3.2580965 1417
## 69 0.003168027 0.03381729 2.0794415 1275
## 70 0.003070108 -0.12817584 1.6094379 655
## 71 0.003070108 -0.12817584 1.6094379 1128
## 72 0.003015725 -0.33426752 2.9957323 1677
## 73 0.003011966 -0.31259349 3.2188758 1109
## 74 0.003011966 -0.31259349 3.2188758 1416
## 75 0.002584898 0.08217734 3.1354942 1010
## 76 0.002575213 0.34672205 4.5538769 1165
## 77 0.002525143 -0.12479792 1.7917595 982
## 78 0.002418897 -0.03460329 1.0986123 68
## 79 0.002418897 -0.03460329 1.0986123 383
## 80 0.002387714 -0.03541132 0.0000000 203
## 81 0.002387714 -0.03541132 0.0000000 343
## 82 0.002387714 -0.03541132 0.0000000 377
## 83 0.002387714 -0.03541132 0.0000000 381
## 84 0.002320347 0.08872064 3.2958369 1131
## 85 0.002297566 -0.03840670 0.0000000 320
## 86 0.002297566 -0.03840670 0.0000000 333
## 87 0.002297566 -0.03840670 0.0000000 399
## 88 0.002297566 -0.03840670 0.0000000 1334
## 89 0.002297566 -0.03840670 0.0000000 1474
# calculate the DFFITS
mock_select$dffits <- dffits(mock_lm_inter_trans)
# plot the DFFITS against the observation number
ggplot(data = mock_select) +
geom_point(mapping = aes(x = as.numeric(rownames(mock_select)),
y = abs(dffits))) +
theme_bw() +
ylab("Absolute Value of DFFITS for Y") +
xlab("Observation Number") +
# for n > 30
geom_hline(mapping = aes(yintercept = 2 * sqrt(length(mock_lm_inter_trans$coefficients) /
length(dffits))),
color = "red", linetype = "dashed") +
# for n <= 30 (code for future, small data sets)
# geom_hline(mapping = aes(yintercept = 1),
# color = "red", linetype = "dashed") +
scale_y_continuous(limits = c(0, 0.8)) +
# scale_y_continuous(limits = c(0, 1.1)) +
theme(aspect.ratio = 1)# print a list of potential influential points according to DFFITS
# for n > 30
mock_select %>%
mutate(rowNum = row.names(mock_select)) %>% # save original row numbers
# select potential influential pts
filter(abs(dffits) > 2 * sqrt(length(mock_lm_inter_trans$coefficients) /
length(dffits))) %>%
arrange(desc(abs(dffits))) # order from largest DFFITS to smallest## machine1 machine2 machine3 RateL2 RateL3 BrokenBK Cases residuals
## 1 0 0 0 1 0 0 33 2.2109014
## 2 0 0 1 0 1 0 8 -1.8330642
## 3 0 0 0 1 0 0 22 1.8054363
## 4 0 1 0 0 1 1 3 -2.1190840
## 5 1 0 0 0 1 1 1 -2.3312761
## 6 0 0 0 1 0 0 18 1.6047656
## 7 0 0 1 1 0 0 10 -1.3787998
## 8 0 0 1 1 0 0 151 1.3358950
## 9 0 1 0 1 0 0 1 -3.3369544
## 10 0 0 0 0 1 0 1 -1.5167270
## 11 0 0 0 0 1 0 1 -1.5167270
## 12 0 1 0 1 0 1 3 -1.8879631
## 13 0 0 1 1 0 0 11 -1.2834896
## 14 0 1 0 0 1 0 1 -3.5680753
## 15 0 0 1 0 1 1 12 -1.0772201
## 16 0 0 0 0 0 1 8 1.1442144
## 17 0 0 1 0 1 0 15 -1.2044556
## 18 0 0 0 1 0 0 1 -1.2856061
## 19 0 0 1 0 0 0 13 -1.1164355
## 20 0 0 1 0 1 0 153 1.1179321
## 21 0 0 1 1 0 0 117 1.0807891
## 22 1 0 0 1 0 1 46 1.7284862
## 23 0 1 0 1 0 0 2 -2.6438072
## 24 0 0 0 0 1 0 15 1.1913232
## 25 1 0 0 0 1 1 2 -1.6381289
## 26 1 0 0 0 1 1 2 -1.6381289
## 27 0 1 0 0 1 0 2 -2.8749281
## 28 0 1 0 0 1 0 2 -2.8749281
## 29 0 0 1 0 1 0 18 -1.0221340
## 30 0 0 1 0 1 1 83 0.8567138
## 31 0 0 0 1 0 1 6 0.8565324
## 32 0 0 0 1 0 0 10 1.0169790
## 33 0 0 0 1 0 0 10 1.0169790
## 34 0 1 0 1 0 0 3 -2.2383421
## 35 0 1 0 1 0 0 3 -2.2383421
## 36 1 0 0 1 0 1 2 -1.4070080
## 37 1 0 0 1 0 1 2 -1.4070080
## 38 1 0 0 0 1 1 41 1.3822960
## 39 0 0 1 0 1 0 21 -0.8679834
## 40 0 0 0 0 1 0 12 0.9681796
## 41 0 0 1 0 1 0 117 0.8496681
## 42 0 0 1 1 0 0 91 0.8294746
## 43 0 1 0 0 0 1 64 1.1723077
## 44 1 0 0 1 0 1 30 1.3010422
## 45 0 0 1 1 0 0 87 0.7845232
## 46 0 1 0 1 0 0 4 -1.9506600
## 47 0 1 0 1 0 0 4 -1.9506600
## 48 0 1 0 1 0 0 4 -1.9506600
## 49 0 1 0 0 1 0 4 -2.1817809
## 50 0 1 0 0 1 0 4 -2.1817809
## 51 1 0 0 0 1 1 3 -1.2326638
## 52 1 0 0 0 1 1 3 -1.2326638
## 53 0 0 1 0 1 0 108 0.7696254
## 54 0 1 0 0 0 1 59 1.0909621
## 55 0 0 1 1 0 0 19 -0.7369459
## 56 0 1 0 1 0 1 56 1.0387763
## 57 0 0 0 0 1 0 2 -0.8235799
## 58 0 0 0 0 1 0 2 -0.8235799
## 59 0 0 0 0 1 0 2 -0.8235799
## 60 0 0 0 0 1 0 2 -0.8235799
## 61 0 0 0 0 1 0 2 -0.8235799
## 62 0 0 0 0 1 0 2 -0.8235799
## 63 0 0 0 0 1 0 2 -0.8235799
## 64 0 0 0 0 1 0 2 -0.8235799
## 65 0 0 0 0 1 0 2 -0.8235799
## 66 0 0 0 0 1 0 2 -0.8235799
## 67 0 0 0 0 1 0 2 -0.8235799
## 68 1 0 0 1 0 1 26 1.1579413
## 69 0 0 0 1 0 0 8 0.7938354
## 70 0 1 0 0 1 0 5 -1.9586374
## 71 0 1 0 0 1 0 5 -1.9586374
## 72 0 0 1 0 0 0 20 -0.6856526
## 73 0 0 1 0 1 0 25 -0.6936300
## 74 0 0 1 0 1 0 25 -0.6936300
## 75 1 0 0 0 0 1 23 1.0353390
## 76 0 0 1 0 1 0 95 0.6413711
## 77 0 1 0 0 1 0 6 -1.7763158
## 78 1 0 0 1 0 1 3 -1.0015429
## 79 1 0 0 1 0 1 3 -1.0015429
## 80 1 0 0 1 0 0 1 -2.4505342
## 81 1 0 0 1 0 0 1 -2.4505342
## 82 1 0 0 1 0 0 1 -2.4505342
## 83 1 0 0 1 0 0 1 -2.4505342
## 84 1 0 0 0 1 1 27 0.9645608
## 85 1 0 0 0 1 0 1 -2.6816551
## 86 1 0 0 0 1 0 1 -2.6816551
## 87 1 0 0 0 1 0 1 -2.6816551
## 88 1 0 0 0 1 0 1 -2.6816551
## 89 1 0 0 0 1 0 1 -2.6816551
## cooksd dffits Log_Cases rowNum
## 1 0.024573483 0.3847045 3.4965076 714
## 2 0.021035348 -0.3556939 2.0794415 194
## 3 0.016386730 0.3139246 3.0910425 1721
## 4 0.013572843 -0.2858550 1.0986123 108
## 5 0.013554427 -0.2857789 0.0000000 179
## 6 0.012946464 0.2789476 2.8903718 791
## 7 0.012195116 -0.2706524 2.3025851 948
## 8 0.011447961 0.2622167 5.0172798 1351
## 9 0.011209384 -0.2605332 0.0000000 137
## 10 0.011137215 -0.2586914 0.0000000 467
## 11 0.011137215 -0.2586914 0.0000000 1541
## 12 0.010946797 -0.2566117 1.0986123 1694
## 13 0.010567402 -0.2519149 2.3978953 97
## 14 0.010188578 -0.2485609 0.0000000 1447
## 15 0.009853857 -0.2432093 2.4849066 455
## 16 0.009480895 0.2385776 2.0794415 1395
## 17 0.009081883 -0.2335175 2.7080502 361
## 18 0.008308906 -0.2233783 0.0000000 211
## 19 0.007995596 -0.2190874 2.5649494 1678
## 20 0.007823934 0.2167231 5.0304379 592
## 21 0.007493167 0.2120848 4.7621739 859
## 22 0.007204603 0.2081258 3.8286414 1736
## 23 0.007036242 -0.2060394 0.6931472 1568
## 24 0.006871015 0.2031114 2.7080502 1271
## 25 0.006692523 -0.2005663 0.6931472 126
## 26 0.006692523 -0.2005663 0.6931472 522
## 27 0.006614538 -0.1998812 0.6931472 603
## 28 0.006614538 -0.1998812 0.6931472 651
## 29 0.006540486 -0.1981335 2.8903718 152
## 30 0.006232595 0.1933874 4.4188406 1081
## 31 0.005312783 0.1785477 1.7917595 1274
## 32 0.005199384 0.1766550 2.3025851 488
## 33 0.005199384 0.1766550 2.3025851 789
## 34 0.005043525 -0.1742889 1.0986123 181
## 35 0.005043525 -0.1742889 1.0986123 1470
## 36 0.004773879 -0.1693418 0.6931472 1058
## 37 0.004773879 -0.1693418 0.6931472 1127
## 38 0.004765361 0.1691855 3.7135721 1163
## 39 0.004716470 -0.1682306 3.0445224 1120
## 40 0.004538095 0.1650319 2.4849066 805
## 41 0.004519527 0.1646785 4.7621739 266
## 42 0.004413567 0.1627342 4.5108595 1039
## 43 0.004220689 0.1591861 4.1588831 1388
## 44 0.004081887 0.1565684 3.4011974 1371
## 45 0.003948164 0.1539102 4.4659081 1658
## 46 0.003830402 -0.1518083 1.3862944 1048
## 47 0.003830402 -0.1518083 1.3862944 1378
## 48 0.003830402 -0.1518083 1.3862944 1687
## 49 0.003809499 -0.1514566 1.3862944 1683
## 50 0.003809499 -0.1514566 1.3862944 1708
## 51 0.003789510 -0.1508454 1.0986123 75
## 52 0.003789510 -0.1508454 1.0986123 520
## 53 0.003708115 0.1491564 4.6821312 1050
## 54 0.003655270 0.1481283 4.0775374 1134
## 55 0.003483812 -0.1445717 2.9444390 847
## 56 0.003313937 0.1410357 4.0253517 331
## 57 0.003283773 -0.1403679 0.6931472 469
## 58 0.003283773 -0.1403679 0.6931472 550
## 59 0.003283773 -0.1403679 0.6931472 1278
## 60 0.003283773 -0.1403679 0.6931472 1279
## 61 0.003283773 -0.1403679 0.6931472 1282
## 62 0.003283773 -0.1403679 0.6931472 1286
## 63 0.003283773 -0.1403679 0.6931472 1314
## 64 0.003283773 -0.1403679 0.6931472 1472
## 65 0.003283773 -0.1403679 0.6931472 1478
## 66 0.003283773 -0.1403679 0.6931472 1615
## 67 0.003283773 -0.1403679 0.6931472 1616
## 68 0.003233340 0.1393260 3.2580965 1417
## 69 0.003168027 0.1378689 2.0794415 1275
## 70 0.003070108 -0.1359113 1.6094379 655
## 71 0.003070108 -0.1359113 1.6094379 1128
## 72 0.003015725 -0.1345047 2.9957323 1677
## 73 0.003011966 -0.1344215 3.2188758 1109
## 74 0.003011966 -0.1344215 3.2188758 1416
## 75 0.002584898 0.1245595 3.1354942 1010
## 76 0.002575213 0.1242902 4.5538769 1165
## 77 0.002525143 -0.1232231 1.7917595 982
## 78 0.002418897 -0.1204899 1.0986123 68
## 79 0.002418897 -0.1204899 1.0986123 383
## 80 0.002387714 -0.1199722 0.0000000 203
## 81 0.002387714 -0.1199722 0.0000000 343
## 82 0.002387714 -0.1199722 0.0000000 377
## 83 0.002387714 -0.1199722 0.0000000 381
## 84 0.002320347 0.1180061 3.2958369 1131
## 85 0.002297566 -0.1177467 0.0000000 320
## 86 0.002297566 -0.1177467 0.0000000 333
## 87 0.002297566 -0.1177467 0.0000000 399
## 88 0.002297566 -0.1177467 0.0000000 1334
## 89 0.002297566 -0.1177467 0.0000000 1474
There are no points that are extremely far away from the rest of the data in both Cook’s Distance and DFFITS plots. Therefore, the describes all observations assumption is probability met.
The independent assumption should meet because the data are collected from all stores, and additional predictor variables are not required for the purpose of the analysis. However, some of the more detail data about machines, like machine temperature might be helpful for more precise prediction.
Although the selection method and anova test do not suggest strong interaction between variables. I decided to try including an interaction plot between Broken and Rate since the results from our EDA suggested there might be a significant interaction between those two variables, which can be beneficial for future reference and indicator.
interaction.plot(x.factor = mock$Broken,
trace.factor = mock$Rate,
response = mock$Cases,
col = c("#1b9e77", "#d95f02"),
lwd = 2,
trace.label = "Rate",
ylab = "Cases",
xlab = "Broken")The interaction plot shows Broken has strong interaction with Rate, specifically for Rate 1. This might be a great target for a more appropriate dataset to see how they affect the sales.
Final Linear Model
mock_lm_inter_trans <- lm(Log_Cases ~ machine1 + machine2 + machine3 +
RateL3 + BrokenBK
, data = mock_select)
summary(mock_lm_inter_trans)##
## Call:
## lm(formula = Log_Cases ~ machine1 + machine2 + machine3 + RateL3 +
## BrokenBK, data = mock_select)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5681 -0.4844 0.1144 0.5764 2.2109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.28561 0.11270 11.408 < 2e-16 ***
## machine1 1.16493 0.11203 10.399 < 2e-16 ***
## machine2 2.05135 0.11801 17.383 < 2e-16 ***
## machine3 2.39578 0.16560 14.467 < 2e-16 ***
## RateL3 0.23112 0.03971 5.820 7.00e-09 ***
## BrokenBK -0.35038 0.07899 -4.435 9.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8134 on 1738 degrees of freedom
## Multiple R-squared: 0.2637, Adjusted R-squared: 0.2616
## F-statistic: 124.5 on 5 and 1738 DF, p-value: < 2.2e-16
Our final model seems to meet all of the assumptions of linear regression.
Our final fitted model is:
\(\log(\widehat{\text{Cases}}_i)\) \(= 1.28561 +\) \(1.16493\times\text{machine1}_i\) + \(2.05135\times\text{machine2}_i\) + \(2.39578\times\text{machine3}_i\) + \(0.23112\times\text{RateL3}_i -0.35038 \times\text{BrokenBK}_i\)
Model Assessment
Now that I have a model that describes the data well with all assumptions met, I would like to use the model to make inferences and predictions. I are interested in creating confidence intervals for the selected variables, as well as getting predictions for new cities. I are particularly interested in the correlation between predicted average cases and machine status.
summary(mock_lm_inter_trans)##
## Call:
## lm(formula = Log_Cases ~ machine1 + machine2 + machine3 + RateL3 +
## BrokenBK, data = mock_select)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5681 -0.4844 0.1144 0.5764 2.2109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.28561 0.11270 11.408 < 2e-16 ***
## machine1 1.16493 0.11203 10.399 < 2e-16 ***
## machine2 2.05135 0.11801 17.383 < 2e-16 ***
## machine3 2.39578 0.16560 14.467 < 2e-16 ***
## RateL3 0.23112 0.03971 5.820 7.00e-09 ***
## BrokenBK -0.35038 0.07899 -4.435 9.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8134 on 1738 degrees of freedom
## Multiple R-squared: 0.2637, Adjusted R-squared: 0.2616
## F-statistic: 124.5 on 5 and 1738 DF, p-value: < 2.2e-16
confint(mock_lm_inter_trans, level = 0.95)## 2.5 % 97.5 %
## (Intercept) 1.0645715 1.5066407
## machine1 0.9452088 1.3846474
## machine2 1.8198975 2.2827990
## machine3 2.0709728 2.7205847
## RateL3 0.1532300 0.3090118
## BrokenBK -0.5053131 -0.1954449
All F-test and p-value of each variable all suggests the predictors has significant correlation with the sale cases. Generally, when there are more machines at the store for more than one years, the average sales case will increase, and a clean, modern store also increase the average case sales. On the other hand, a broken machine will decrease the average sales. The baseline of the model is zero machine at the store for more than one year and no broken machine at the store.
The confidence intervals for the variables are very informative. For example, I are 95% confident that, on average, when there are three machines at the store for more than one year with non-broken machine and the store is rated as filthy, the cases sale will increase from 207% to 272% comparing to the store with zero machines at the store for more than one year. And when there is one machines at the store for more than one year, on average, the cases sale will increase from 94% to 138% comparing to the store with zero machines at the store for more than one year, which suggests that increasing the number of machine at the store will have a positive influence on cases sales. On the other hand, holding all else constant, I are 95% confident that a broken machine at the store will lower the sale from 20% to 50% on average, comparing no broken machine at the store. This result indicates that maintain a functional machine is crucial for a higher case sale. Despite not significant, I are 95% confident that a clean, modern store will have a case sale increase from 15% to 30% on average, holding all else constant. A customer friendly store will have positive influence for the cases sale.
I are also interested in how well the model fits the data. To do this, I look at metrics such as the RMSE. These metrics are important to check and understand because I can know how well the model is fitting.
# MSE, RMSE
anova <- aov(mock_lm_inter_trans) # get ANOVA components
mock_lm_inter_trans_anova <- summary(anova)[[1]] # save data in a usable form
mse = mock_lm_inter_trans_anova["Residuals", "Sum Sq"] /
mock_lm_inter_trans_anova["Residuals", "Df"]
(sqrt(mse))## [1] 0.8133812
The RMSE represents the average amount of spread in the residuals. The RMSE indicates that the average error in the model is around 0.81 sales. Since the mean value is 2.987, the average error perform by the model might be the potential concern when interpreting the data from the analysis.
Summary and Conclusions
Our business model will maintain a higher number of Frazil-owned machines. Therefore, understanding the correlation between machine utilization and cases sale is critical to the success of out business, and we want to see what the correlation is to predict future sale and effectively allocating the machines. The linear model shows there are significant relationships between cases-machine , cases-brokenMachine, and cases-storeRate. Generally, more machines at the store for more than one year will have a higher average cases sale. When a machine is broken at the store will lower the average cases sale, and a clean , modern store will have a higher average cases. In a word, while a store maintain the relationship with Frazil for more than one year or even have multiple machine, the average cases sale can increase from 100% to 270% respectably. Keeping stable relationship with the restaurants could promote the sales. In addition, for those store have clean environment and sustain the machine at the working condition, the average sale cases will be higher. The company should develop stable partnership with the restaurant that have high hygiene standard and cautious about the machines.